Analyzing CUDA’s Compiler through the Visualization of Decoded GPU Binaries

نویسندگان

  • Cedric Nugteren
  • Bart Mesman
  • Henk Corporaal
چکیده

With GPU architectures becoming increasingly important due to their large number of parallel processors, NVIDIA’s CUDA environment is becoming widely used to support general purpose applications. To efficiently use the parallel processing power, programmers need to efficiently parallelize and map their algorithms. The difficulty of this task leads to the idea to investigate CUDA’s compiler. Part of the compiler in the CUDA tool-chain is entirely undocumented, as is its output. To draw conclusions on the behaviour of this compiler, the resulting object code is reverse engineered. A visualization tool is introduced, analyzing the previously unknown compiler behaviour and proving helpful to improve the mapping process for the programmer. These improvements focus on the area of register allocation and instruction reordering. This paper describes an extension to the CUDA tool-chain, providing programmers with a visualization of register life ranges. Also, the paper presents guidelines describing how to apply optimizations in order to obtain a lower register pressure. In a case-study example, performance increases by 33% compared to already optimized CUDA code. This is achieved by optimizing the code with the help of the introduced visualization tool. Also, in 11 other case-study examples, register pressure is reduced by an average of 18%. The presented guidelines could be added to the compiler to enable a similar register pressure reduction to be achieved automatically at compile-time for new and existing CUDA programs.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Neural Nets Can Learn Function Type Signatures From Binaries

Function type signatures are important for binary analysis, but they are not available in COTS binaries. In this paper, we present a new system called EKLAVYA which trains a recurrent neural network to recover function type signatures from disassembled binary code. EKLAVYA assumes no knowledge of the target instruction set semantics to make such inference. More importantly, EKLAVYA results are ...

متن کامل

Translating GPU Binaries to Tiered SIMD Architectures with Ocelot

Parallel Thread Execution ISA (PTX) is a virtual instruction set used by NVIDIA GPUs that explicitly expresses hierarchical MIMD and SIMD style parallelism in an application. In such a programming model, the programmer and compiler are left with the not trivial, but not impossible, task of composing applications from parallel algorithms and data structures. Once this has been accomplished, even...

متن کامل

Efficient asynchronous executions of AMR computations and visualization on a GPU system

Adaptive Mesh Refinement is a method which dynamically varies the spatio-temporal resolution of localized mesh regions in numerical simulations, based on the strength of the solution features. Insitu visualization plays an important role for analyzing the time evolving characteristics of the domain structures. Continuous visualization of the output data for various timesteps results in a better...

متن کامل

Simty: generalized SIMT execution on RISC-V

We present Simty, a massively multi-threaded RISC-V processor core that acts as a proof of concept for dynamic inter-thread vectorization at the micro-architecture level. Simty runs groups of scalar threads executing SPMD code in lockstep, and assembles SIMD instructions dynamically across threads. Unlike existing SIMD or SIMT processors like GPUs or vector processors, Simty vectorizes scalar g...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010